A Classification of Schema Mappings and Analysis of Mapping Tools
نویسندگان
چکیده
Schema mapping techniques for data exchange have become popular and useful tools both in research and industry. A schema mapping relates a source schema with a target schema via correspondences, which are specified by a domain expert possibly supported by automated schema matching algorithms. The set of correspondences, i.e., the mapping, is interpreted as a data transformation usually expressed as a query. These queries transform data from the source schema to conform to the target schema. They can be used to materialize data at the target or used as views in a virtually integrated system. We present a classification of mapping situations that can occur when mapping between two relational or nested (XML) schemata. Our classification takes into consideration 1:1 and n:m correspondences, attribute-level and higher-level mappings, and special constructs, such as choice constraints, cardinality constraints, and data types. Based on this classification, we have developed a general suite of schemata, data, and correspondences to test the ability of tools to cope with the different mapping situations. We evaluated several commercial and research tools that support the definition of schema mappings and interpret this mapping as a data transformation. We found that no tool performs well in all mapping situations and that many tools produce incorrect data transformations. The test suite can serve as a benchmark for future improvements and developments of schema mapping tools. 1 Schema Mappings, Data Exchange, and Mapping Tools The problem of information integration, i.e., enabling access to multiple distributed, autonomous, and heterogeneous data sources through a common interface (common schema, common query language), is eminent in database research and information systems development. Apart from technical challenges the problem of schematic heterogeneity is particularly obtrusive. Even when using a common data model there are many different ways of modeling the same real world entities and relationships. In this article we focus on the technique of schema mapping to describe the relationships between two schemata in the context of data transformation. Schema mappings and their use for data exchange has recently become a popular notion as exemplified in [AL05, FKP05, MBHR05] and many other projects and tools. Schema mappings have a wide array of further usages, such as schema translation [SL90] and schema integration [BLN86], not considered in this article. Regard the simple example of Fig. 1. It shows two nested schemata, both modeling persons and their membership in teams. In the first schema, the membership of a person to a team is modeled as a foreign key constraint, in the second, the membership is modeled by nesting person elements under team elements. Also observe that different facts are represented in the different schemata. The first schema models the address of a person while the second schema does not. And vice versa, the second schema models the person’s date of birth (DOB) while the first does not. Even when elements with the same meaning are included, their labels need not coincide. For instance, teamURL and website have the same semantics but different labels. Figure 1: A mapping between two nested schemata Given these and many more heterogeneities among schemata, schema mapping describes the general technique of relating elements between two heterogeneous schemata. The relationships are based on some semantic similarity of the elements and are usually called correspondences; graphically, correspondences are modeled as arrows from an element in a source schema to an element in a target schema; conceptually, a correspondence states that data for the target element can be obtained by fetching the data stored at the corresponding source element. Formally, a correspondence is a relation between a set of elements in the source schema and a set of elements in the target schema. This relation is identified by a transformation function that transforms source data into target data or a filter that selects elements of the source schema [Leg05]. This definition also allows m:n correspondences. A schema mapping is a set of correspondences between a source schema and a target schema. Schema mappings are used to transform data stored under the source schema so that it conforms to the target schema. In a typical situation, the source schema might be that of a data source and the target schema is the federated schema of an integrated system. In a data exchange scenario the two schemata are those of peers willing to exchange data. Figure 1 shows a schema mapping that relates attributes of the left-hand schema, the source schema, with the right-hand schema, the target schema. This situation raises two important questions that have been extensively dealt with in recent literature and in recent products: (i) How can one obtain the correspondences between two schemata and (ii) how can one interpret a set of correspondences to actually transform source data so that it conforms to the target schema. In this article we focus on the second question and assume that the correspondences have been established. Informally, the interpretation of a schema mapping faces several difficulties: Source schema interpretation: Associations of data elements (relations, nesting, foreign keys) in the source schema should be recognized and preserved during transformation. Target schema conformance: The result of a transformation should conform to the target schema. User intention: The expert users merely provide informal correspondences (“arrows”) between the schemata. Even under the demand that an interpretation of such a mapping should cover all correspondences there usually remain several alternatives. Correctly guessing the intended one can rely only on suitable heuristics. In this article we classify these difficulties and evaluate several schema mapping tools in their ability to overcome them. We observe that surprisingly not even the first two difficulties are adequately addressed in many of the tools. We limit the scope of our evaluation to the relational and XML data models, because of their widespread usage and the fact that they are supported by most schema mapping tools. Hence, we also limited the number of query languages that are used to transform data from the source to the target. For simplicity in this article we restrict ourselves to a graphical notation for schemata and mappings. Even though we only consider the relational and XML data model in this article, the basis for our research is a schema definition language that is independent of the data model. Both the relational data model and the XML data model can be transferred into this schema definition language. The schema definition language, the transformation of the relational and the XML model into the schema definition language, and the formal definition of each mapping situation can be found in [Leg05]. 2 A Classification of Mapping Situations Figure 2 shows our classification of schema mapping situations, distinguishing three main classes: mapping situations related to missing correspondences displayed in the left subtree, mapping situations related to single correspondences in the middle, and mapping situations belonging to multiple correspondences on the right. The following sections introduce each of the classes and outline possible ways for mapping tools to interpret them. A classification similar to ours was presented by Kim et al., who classify conflicts between relational schemata based on their structure [KS91]. We used this classification as a basis for ours and extended it to include non-relational features. Previous research in schematic heterogeneity yielded several classifications of correspondences and mapping conflicts. Batini et al., for example, distinguish four types of semantic relationships: identical, equivalent, compatible, and incompatible [BLN86]. Every non-identical relationship implies a conflict. Their classification requires the existence of semantic knowledge about the participating schemata. This requirement means that users know exactly which real world object is modeled by which schema element. Our classification, on the other hand, just utilizes the structural information given by the schemata and the mapping between them. Figure 2: Classification of schema mapping situations Spaccapietra et al. classify conflicts between schemata into semantic conflicts, descriptive conflicts, heterogeneity conflicts, and structural conflicts [SPD92]. Because our classification assumes a given schema mapping, we assume both semantic conflicts and descriptive conflicts as resolved. The two remaining classes contain the conflicts that are interesting for our purpose, but were not sufficiently analyzed. 2.1 Missing correspondences This class contains mapping situations that occur if leaf nodes or inner nodes of the schema are not part of the schema mapping, i.e., no correspondence is connected to the nodes. If leaf nodes of the source schema are not part of the mapping, the only important consideration is a loss of information, meaning that the source data cannot be recreated from the target data after the transformation. On the contrary, if leaf nodes of the target schema are left out of the mapping, constraints of the target schema could be violated: • If the node is part of a key or unique constraint (not null) a mapping tool could automatically generate a value, as Popa et al. demonstrate [PVM02]. Another possible solution is to reject the mapping and ask the user for manual resolution. • If the node is part of a foreign-key constraint, a tool could automatically detect a connection between the key and the foreign-key using the information from the source schema. If that is not possible, a manual resolution is necessary. • If the node is mandatory and a default value is given, a tool should use this information. Otherwise, a manual resolution would be the best option. A random value is also an acceptable but not optimal solution. The fact that the value is required implies that it is relevant for the application; inserting a random value contradicts this intuition. • If the node is not mandatory, a tool can neglect it, preferably with some warning. Whether a missing correspondence to and from inner nodes of a schema tree influences the produced transformation query or not depends on the definitions of correspondences and mappings. Problems occur depending on whether a tool allows and interprets correspondences between inner nodes or not (see later). 2.2 Single correspondences Mapping situations related to single correspondences are outlined in this section. The classification is based on the properties of correspondences along different dimensions: 1. Cardinality of the correspondence. According to the number of the participating schema elements in a correspondence, 1:1, 1:n, n:1, and n:m correspondences can be identified. Only 1:1 correspondences are further analyzed in this section. The remaining three types are analogous to problems at inner node level: the question on how to combine multiple input values and how to produce multiple output values are similar to the questions discussed for multiple correspondences (see Sec. 2.3). 2. Type of the participating schema elements. Elements associated with a correspondence can either be leaf nodes or inner nodes. • Correspondences between leaf nodes are the most common kind of correspondences and are supported by every schema mapping tool. Their purpose is to define which source data to transform into which target data. • Correspondences between inner nodes are not supported by all mapping tools. If they are supported, then for every element in the source instance an element in the target instance could be created. • When connecting inner nodes with leaf nodes, the data and metadata levels are mixed. If no additional transformation function is given, a mapping tool should reject this correspondence. Figure 3 shows an example of metadata in the source schema corresponding to data in the target schema. 3. Properties of the participating schema elements and constraints. Correspondences can also be classified according to the properties that participating schema elements have (see bottom left of Fig. 2). Figure 3: Data vs. metadata There are several cardinality-related situations, which can be classified as situations with a loss of information (when source data cannot be transferred into the target schema due to a higher cardinality in the source schema) and situations with a lack of information (when the source instance does not include enough information to build a valid target schema instance due to a lower cardinality in the source schema). Both situations should be recognized by a mapping tool and treated accordingly. For example, if in a 1:1 correspondence the source node is nullable and the target node is mandatory with no default value, a mapping tool should ask for a manual resolution for cases where the source value is null. Conversion tables or cast tables define how data types can be transformed into each other (also spanning different data models). They can be used to resolve mapping situations related to different data types. According to these tables there are three classes of compatibility: compatible, partly compatible (depending on the concrete value), and incompatible data types. For compatible data types, mapping tools could insert casts according to these conversion tables, and warn users if data types are not or only partly compatible. 2.3 Multiple correspondences This class comprises mapping situations related to multiple correspondences. The categories in this class are based on the structure of the source and target schema rather than on the properties of individual elements. Research projects addressing this topic developed sophisticated algorithms to discover clusters of semantically connected schema elements (e.g., [MHH00, PVM02]). To be able to better comprehend and compare the tests and results, a simplified approach was chosen. According to this approach, three different kinds of relationships (associations) between elements of a schema can be distinguished: • Elements have a structural connection if they have at least one common ancestor, whose cardinality is greater than one. A mapping tool should recognize structurally connected elements in a source schema and maintain their semantic relationship. This can be done by transforming structurally connected elements simultaneously, e.g., leaf nodes with the same parent node. Furthermore, a mapping tool should be able to combine nested elements by un-nesting them where appropriate. Regarding structurally connected elements in the target schema, a tool has the choice whether to group the data or not. This question also leads to the problem whether an aggregation of the data is possible. Figure 4 shows an example where a grouping of families under the same lastname and an aggregation of the familyincome is desirable. Figure 4: Grouping and aggregation • A foreign-key based connection between elements is given if the elements reside in subtrees, that are inter connected with a foreign-key relationship. This class also includes explicitly defined joins between subtrees, which can be captured with some mapping tools. A mapping tool should recognize source schema elements, that are connected by a foreign-key and maintain their semantic relationship. Hence, a transformation query should contain a (outer-)join over the connected subtrees. For target schema elements with a foreign-key connection two situations can be distinguished: Either the key and foreign-key elements are part of the mapping or they are not part of the mapping. In case they are part of the mapping, the user already provides the information on how to establish a semantic relationship between the target values. If they are not part of the mapping, a schema mapping tool could automatically create key/foreign-key pairs for related elements. This can be realized with skolem-functions (see [PVM02]). • Nodes that are connected neither structurally nor foreign-key-based are connectionless nodes. Connectionless in this context means that they can be arbitrarily combined in a transformation from the source to the target. Connectionless nodes could, for example, be combined with a Cartesian product, an outer join, or listed in the order of their appearance in the instance of the source schema. Additionally, we classified mapping situations, that result from choice constraints. A choice constraint specifies that in an instance a node might have only one child node from the set of nodes listed in the schema. A schema mapping tool should treat source schema nodes with a choice constraint similarly to nodes that are not mandatory. Unless all child nodes of a source node with a choice constraint are associated with the same target schema node, it cannot be guaranteed that a value is transferred into the target. The problem is then a lack of information, similar to the cardinality-related situations where the cardinality of the element in the source schema is lower than the cardinality of the element in the target schema. A schema mapping tool should recognize these situations and either reject the mapping or inform the user about the problem. For a target schema node with a choice constraint it must be guaranteed that exactly one child node is produced. Accordingly, a mapping tool should recognize situations, where no child node or more than one child node could be created and either reject those situations or warn the user. A characteristic that is independent of a certain mapping situation, is the production of duplicates. A tool can either produce duplicate values or produce distinct values. Even though both solutions are correct, it would be desirable to give the user the choice, whether a tool should produce duplicates or not. For brevity, we foreclose the results of this aspect of schema mapping: All tools, except for one, produce per default duplicate values. Only two tools give the user the choice, whether to produce duplicates or not. One of them provides a special operator, whereas the other one requires the manual insertion of script. 3 Schema Mapping Tools In this section we briefly describe the selection of six schema mapping tools from research and industry that we have evaluated. All tools feature a graphical user interface displaying source and target schema as a tree and allow to draw lines between elements of the schemata. The evaluation with respect to their capability to interpret schema mappings in the next section is anonymized, so this section serves only as a pointer to different tools and a comparison of their general capabilities. BizTalk Mapper 2004 is part of Microsoft’s BizTalk Server to manage business processes (http://www.microsoft.com/biztalk/). Messages and data from different sources are converted to XML and the BizTalk mapper allows to specify transformations on this data. In addition to the pure mapping functionality, BizTalk Mapper offers a large library of “functoids” to transform data values. The mappings are interpreted as XSLT scripts. Clio is not a commercially available tool, but a research prototype developed at IBM’s Almaden Research Center (http://www.almaden.ibm.com/software/km/clio/). It was one of the first and most sophisticated tools motivating many theoretical and practical research results. The product version of Clio is now part of Rational Data Architect. MapForce 2005 is part of Altova’s XML suite of tools to help users develop XML applications, which also includes XMLSpy (http://www.altova.com/products/ mapforce/). Like Clio, MapForce is a tool developed solely for the purpose of schema mapping and deriving transformation queries. Oracle Warehouse Builder 10g Release 1 is a tool to develop data warehouses based on the Oracle 10g database system (http://www.oracle.com/technology/products/ warehouse/). Part of this development is the ETL process (Extract, Transform, Load), which in turn includes a schema mapping step, among many others. We discuss this tool as a representative for many ETL tools offered by database and data warehouse vendors and companies specializing in ETL. Stylus Studio 6 is an XML integrated development environment by Progress Software specializing on XQuery / XSLT creation and visualization (www.stylusstudio.com). With its XQuery Mapper developers can create and visualize mappings between XML schemata, which are in turn interpreted as XQueries or XSLT transformation queries. WebSphere Studio Application Developer Integration Edition is IBM’s integrated development environment including a collection of tools to develop Web Services, Portals, etc. (http://www.ibm.com/software/integration/wsadie/). One feature is a tool to map between XML Schemata, namely the XML-to-XML Mapping Editor. Schema mapping is a research and development field of growing importance. Apart from the tools described here, there are several others including most ETL tools, the schema management prototype Rondo [MRB03], the SAP Exchange Infrastructure, etc. We also note that the tools are also still under development or are research prototypes that do not aim at full coverage of all possible situations. We have noted the versions we used and are aware of the fact that many of the observed problems are in fact bugs and are likely to be corrected in later versions. However, other problems reach deeper into the semantics of mapping and there is no easy or obvious “fix”. We believe that the set of experiments described in the next section are well-suited not only to document the state-of-the-art, but also to track progress in their development.
منابع مشابه
A Semi Automatic Tool For Schema Mapping
neric mapping framework at the schema level to address the problem of schema interoperability Providing a formalism for developing a generic, extensible, and semi-automated mapping A semi-automatic tool for schema mapping. at the University of Washington in Seattle, where he founded the database group. on Clio, the first semi-automatic tool for heterogeneous schema mapping. Keywords: data integ...
متن کاملSchema Mappings and Data Examples: Deriving Syntax from Semantics (Invited Talk)
Schema mappings are high-level specifications that describe the relationship between two database schemas. Schema mappings are considered to be the essential building blocks in such critical data interoperability tasks as data exchange and data integration. For this reason, they have been the focus of extensive research investigations over the past several years. Since in real-life applications...
متن کاملSchema Mapping Evolution Through Composition and Inversion
Mappings between different representations of data are the essential building blocks for many information integration tasks. A schema mapping is a high-level specification of the relationship between two schemas, and represents a useful abstraction that specifies how the data from a source format can be transformed into a target format. The development of schema mappings is laborious and time-c...
متن کاملSchema Mappings: From Data Translation to Data Cleaning
Schema mapping management is an important research area in data transformation, integration, and cleaning systems. The reasons for its success can be found in the declarative nature of its building block (thus enabling clean semantics and easy to use design tools) paired with the efficiency and modularity in the deployment step. In this chapter we cover the evolution of schema-mappings through ...
متن کاملOn fixed points of fundamentally nonexpansive mappings in Banach spaces
We first obtain some properties of a fundamentally nonexpansive self-mapping on a nonempty subset of a Banach space and next show that if the Banach space is having the Opial condition, then the fixed points set of such a mapping with the convex range is nonempty. In particular, we establish that if the Banach space is uniformly convex, and the range of such a mapping is bounded, closed and con...
متن کامل